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NONPARAMETRIC METHODS FOR INFERENCE IN THE 
PRESENCE OF INSTRUMENTAL VARIABLES 

By Peter Hall and Joel L. Horowitz 1 

Australian National University and Northwestern University 

We suggest two nonparametric approaches, based on kernel meth- 
ods and orthogonal series to estimating regression functions in the 
presence of instrumental variables. For the first time in this class of 
problems, we derive optimal convergence rates, and show that they 
are attained by particular estimators. In the presence of instrumental 
variables the relation that identifies the regression function also de- 
fines an ill-posed inverse problem, the "difficulty" of which depends 
on eigenvalues of a certain integral operator which is determined by 
the joint density of endogenous and instrumental variables. We de- 
lineate the role played by problem difficulty in determining both the 
optimal convergence rate and the appropriate choice of smoothing 
parameter. 

1. Introduction. Data (Xi,YA are observed, the pairs being generated 
by the model 

(1.1) Y i =g(X i ) + U i , 

where g is a function which we wish to estimate and the U^s denote distur- 
bances. The C/j's are correlated with the explanatory variables Xi and, in 
particular, E(Ui\Xi) does not vanish. For example, this may occur if a third 
variable causes both Xi and Yi, but is not included in the model. 

This circumstance arises frequently in economics. To illustrate, suppose 
that Yi denotes the hourly wage of individual i, and that Xi includes the 
individual's level of education, among other variables. The "error" Ui would 
generally include personal characteristics, such as "ability," which influence 
the individual's wage but are not observed by the analyst. If high-ability 
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individuals tend to choose high levels of education, then education is cor- 
related with ability, thereby causing Ui to be correlated with at least some 
components of X\. 

Suppose, however, that for each i we have available another observed data 
value, Wi, say (an instrumental variable), for which 



and there is a "sufficiently strong" relationship between Xi and W{. Then 
there is an opportunity for estimating g from the data (Xi,Wi,Yi). 

The formal definition of "sufficiently strong" will depend on the nature 
of the problem. In a parametric setting, for example, where g{Xi) = Xi/3, 
Xi is an m X k matrix and f3 is a k X 1 vector, "sufficiently strong" means 
simply that the matrix of correlations between X and W is of full rank; this 
is sometimes expressed as l X and W are fully correlated." In a nonpara- 
metric setting the definition of "sufficiently strong" is given by, for example, 
condition (2.1) below. 

Estimation of g is difficult because, as explained in Section 2, the relation 
that identifies g is a Fredholm equation of the first kind, 



say, which leads to an ill-posed inverse problem [9, 14]. We use a ridge- 
type regularization method to achieve boundedness of the relevant inverse 
integral operator, and develop both kernel and series estimators of g. The 
resulting estimators have optimal L2 rates of convergence. 

Closely related inverse problems, where the context is rendered relatively 
abstract in order to facilitate solution, include those studied by Donoho [4], 
Johnstone [8] and Cavalier, Golubev, Picard and Tsybakov [2]. That work 
addresses the white-noise model, rather than the more explicitly realistic 
discrete-data setting of (1.1). In such treatments the operator T is generally 
assumed known, whereas in the case of instrumental-variables problems it 
usually must be estimated from data. Nevertheless, the optimal convergence 
rates obtained in the above earlier work are identical to our own. Indeed, 
the mean integrated squared error rates we obtain are the same as those in 
an "ordinary" inverse problem, where T is known and equal to T[T±, and T\ 
is the nonstochastic transformation of the actual inverse model. Efromovich 
and Koltchinskii [5] treated a white-noise model in a setting where T, at 
(1.3), must be estimated, and also obtained optimal rates. 

Research on this type of problem in econometrics is mostly very recent. 
Blundell and Powell [1] and Florens [6] discussed the relationship between 
(1.1) and other "structural" models in econometrics. Newey, Powell and Vella 
[13] investigated estimation and inference with a triangular-array version 
of (1.1). In that setup, equations relate Xi and Wi, and the disturbances of 



(1.2) 



E(Ui\Wi) = 



(1.3) 



Tg = (f> 
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these equations are connected to Newey and Powell [12] proposed a series 
estimator for g in (1.1) and gave sufficient conditions for its consistency, 
but did not obtain a rate of convergence. Darolles, Florens and Renault 
[3] developed a kernel estimator for a special case of (1.1) and obtained its 
rate of convergence. This rate is slower than that obtained here. However, 
Darolles, Florens and Renault [3] make assumptions that conflict with ours, 
and it is not known whether their rate is optimal under their assumptions. 

Further related work on inverse problems includes that of Wahba [17], 
Tikhonov and Arsenin [15], Groetsch [7], Nashed and Wahba [11] and Van 
Rooij and Ruymgaart [16]. 

We shall give a relatively detailed treatment, together with proofs, of 
results in the case where the instrumental variable is univariate. This setting 
is arguably of greatest interest to statisticians. Extensions to multivariate 
cases will be outlined. 

2. Model and estimators in bivariate case. 

2.1. Model. Let (£/j, Wi,Xi, Y]), for i > 1, be independent and identically 
distributed 4- vectors, and assume they follow a model satisfying (1.1) and (1.2) 
We shall suppose that (Wi,Xi,Yi), for 1 < i < n, are observed, and that the 
distribution of (Xi,Wi) is confined to the unit square. 

Denote by fx, fw and fxw the marginal densities of X and W, and the 
joint density of X and W, respectively, and define the linear operator T on 
the space of square- integr able functions on [0, l] 2 by 



The following assumption characterizes the strength of association we re- 
quire between X and W: 



To appreciate the nature of (2.1), observe that if X and W are indepen- 
dent, then T maps each function tp to a constant multiple of fx, and so (2.1) 
fails. However, if (2.1) holds, then since it may be proved from (1.1) and (1.2) 
that 




where 




(2.1) 



T is nonsingular. 
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This property suggests an estimator, which we shall develop in Section 2.2. 

Observe that (2.2) is a Fredholm equation of the first kind, and generates 
an ill-posed inverse problem if, as is usually the case, zero is a limit point 
of the eigenvalues of T. In that case, T" 1 is not a bounded, continuous 
operator. For the purpose of estimation, we shall deal with this problem in 
Section 2.2 by replacing T _1 by (T + a n ) _1 , where a n is a positive ridge 
parameter converging to zero as n — ► oo. 

2.2. Generalized kernel estimator. Let fxw have r continuous deriva- 
tives with respect to any combination of its arguments. Let Kh(-, ■) denote 
a generalized kernel function, with the properties Kh(u,t) = if u > t or 
u<t — l, 

for all t E [0, 1] 
(2 ' 4) fc-G+D [' uiK^du-l 1 ' if ^ = °' 



t-i 



0, if 1 < J < r — 1 . 



Here, h > denotes a bandwidth, and the kernel is considered in generalized 
form only to overcome edge effects. In particular, if h is small and t is not 
close to either or 1, then we may take Kh(u,t) = K(u/h), where K is an 
rth order kernel. If t is close to 1, then we may take Kh(u,t) = L(u/h), 
where L is a bounded, compactly supported function satisfying 



u 3 L(u) du 



1, if J =0, 

0, ifl<J<r-l. 



And if t is close to 0, then we may take Kh(u,t) = L(—u/h). There are, of 
course, other ways of overcoming the edge-effect problem, but the "boundary 
kernel" approach above is also appropriate. 

We require two estimators of fxw, the second a leave-one-out estimator, 

1 n 

fxw(x,w) = —r?y2K h (x - Xi,x)K h (w - Wi,w), 

fxW {X,W)= — —3 Kh ( X ~ X 3 ' X ) K h ( W ~ W j > w ) ■ 

We use fxw to construct the following estimators of t(x, z) and the trans- 
formation T: 

i(x,z) = J fxw(x,w)fxw(z,w)dw, (fip)(z) = J t(x,z)ip(x)dx. 

Let a n > 0; we shall use it as a ridge parameter when inverting T, defining 
T + = (T + a n I)~ l , where / is the identity operator. Reflecting (2.3), our 
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estimator of g is 

1 n 

g(x) = -^(T+f<gt)(z,W i )Y i . 

fi . 1 

i=i 

An alternative approach would be to develop a spectral expansion of T, 
truncate it to a finite series, and invert this series. The smoothing parameter 
now becomes the number of terms in the series, rather than the ridge, a n . 
Theory may be developed for this "spectral cut-off" approach, too. However, 
it appears to require regularity conditions on spacings between adjacent 
eigenvalues of T, as well as a condition on their rate of decrease (see A. 3 in 
Section 4.1), and for this reason we do not pursue it here. 

2.3. Orthogonal series estimator. This technique is based on empirically 
transforming the marginal distributions of W and X to uniform, and exploit- 
ing the relatively simple character of the problem in that case. To appreciate 
this point, assume for the time being that both marginals are in fact uni- 
form on [0, 1], and let XI1X2, • • ■ denote an orthonormal basis for L2[0, 1]. In 
practice, one would usually take {xj} to be the cosine sequence, although 
there are many other options. 

Let fxw{ x i w) = J2jJ2kQjkXj( x )Xk(w) denote the generalized Fourier ex- 
pansion of fxw, and put Q = (q jk ), pj = E{Yxj(W)}, 7, = E{g(X)xj(X)}, 
p = (pj) and 7 = (7j), the latter two quantities being column vectors. By 
(1.1) and (1.2), QQ' 1 = Qp and, therefore, 7 = (QQ'^Qp. [This is really 
another way of writing (2.3); observe that the operator T takes g to a func- 
tion of which the jth Fourier coefficient is {QQ'^()j-\ Hence, the problem of 
estimating the Fourier coefficients jj of g reduces to one of estimating pj 
and q jk . 

Next we describe how to solve the latter problem in general cases, where 
marginal distributions are not uniform. First transform the marginals, by 
computing Wj = Fw(Wi) and Xi = Fw(Xi), where Fyy and Fx denote the 
empirical distribution functions of the data W\ W n and X-u. . . , X n , re- 
spectively. Put q jk = rT x Y,iXj(Wi)xk(Xi) and pj = n' 1 EiXj(Wi)*i- Let Q 
be the mxm matrix that has qj k in position (j,k), and set 

7 = (%) = iQQ' + anlm^Qp, 

where a n denotes a ridge parameter and I m is the mxm identity. Our 
estimator of g is 

m 

3=1 

In this estimator the number of terms, m, in the approximating Fourier 
series is the main smoothing parameter. It is relatively awkward to derive 
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theory for the orthogonal series method, owing to the fact that the trans- 
formed data Wi and Xj are not independent, and to the difficulty of dealing 
theoretically with the large random matrix Q. Nevertheless, we shall show 
in Section 4 that, under restrictions, the orthogonal series technique has 
optimal performance. 

3. Model and estimators in the multivariate case. In the model at (1.1) 
the explanatory variable X is endogenous, that is, determined within the 
model. When the model is multivariate, there is an opportunity for divid- 
ing the explanatory variable, which is now a vector, into two parts, one 
endogenous and the other determined outside the model, or exogenous. 

We take (Y, X, Z,W,U) to be a vector, where Y and U are scalars, X 
and W are supported on [0,l] p , and Z is supported on [0, l] g . Generalizing 
(1.1) and (1.2), the model is 



where (Yi,Xi,Zi,Wi,Ui), for % > 1, are independent and identically dis- 
tributed as (Y,X, Z,W,U). Thus, X and Z are endogenous and exogenous 
explanatory variables, respectively. Data (Yi,Xi,Zi,Wi), for 1 < i < n, are 
observed. 

Let fxzw denote the density of (X, Z, W), write fz for the density of Z, 
and for each xi,X2 € [0, l] p put 



the analogue of t(x\,X2) in Section 2. Define the operator T z on L2[0, l] p by 



Analogously to (2.3), it may be proved that, for each z for which T z 1 exists, 



bution of W conditional on Z. In this formulation, (T^ 1 fxzw)(x, z,W) 
denotes the result of applying T" 1 to the function fxzw{~, z,W) and eval- 
uating the resulting function at x. 

To construct an estimator of g(x, z), given h > and p- vectors x = (x^ , . . . 
>)) and £ = (£«..., ^)), let K Pjh (x,S) = Ux<j< P K h (x^, 
put K q ^{z,() analogously for (/-vectors z and £, let h x ,h z > 0, and define 



Y l = g(X i ,Z l ) + U l 



E(Ui\Zi,Wi) = 0, 






1 



n 



fxzw{x,z,w) 



nh x p h q z 



qYs K P>h*(. X ~ Xi > X ) 



z i=l 
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x K q)hz {z - Zi,z)K Pihx (w - Wi,w), 
fxZW (X,Z,W) = — ^ 2puq K VM ( X - X 3 ' x ) 



(n-l)hl p h q z i<jj^..j- 



x K q>hz (z - Zj,z)K Pthx (w - Wj,w), 
t z (xi,x 2 ) = J fxzw(xi,z,w)f X zw(x 2 ,z,w)dw 

and 

{T z ip)(x,z,w) = I i z (£,x)il){g,z,w)d£, 



where ip is a function from R p+q to the real line. Then the estimator of 
g(x,z) is 



1 n 



g(x, z) = ^Z^fxzw)^ *, Wi)YiK q , h . (z-Z u z). 



i=l 

4. Theoretical properties. 

4.1. Kernel method for bivariate case. The invertibility of T is central 
to our ability to successfully resolve g from data, and so it comes as no 
surprise to find that rates of convergence of estimators of g hinge on the 
rate at which the eigenvalues of T, say Ai > A2 > • • • > 0, converge to 0. 
Therefore, our regularity conditions will be framed in terms of an eigen- 
expansion representation of T. To this end, let 4>j denote an eigenfunction 
of T with eigenvalue Xj , normalized so that <jyy , <j>2 , . ■ . is an orthonormal 
basis for the space of square- integr able functions on the interval [0,1]. Then 
we may write 

00 

t(x,z) = ^2X j <i)j(x)(f) j (z), 
i=i 

00 00 

(4.1) fxw(x,z) =^2^2d jk (/) j (x)(/> k (z), 

j=i k=i 

00 

where djk and bj denote generalized Fourier coefficients of fxw an d g, re- 
spectively. 

Next we state regularity conditions. Assumption A.l is equivalent to the 
intersection of (1.1) and (1.2); A. 3 gives smoothness conditions, expressed 
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through the eigen-expansion of T; A. 2 and A. 3 together imply that T is a 
bounded Hilbert-Schmidt operator and, hence, compact; and A. 4 describes 
the sizes of tuning parameters. The invertibility condition (2.1) is equivalent 
to asking that each Xj > 0, which in turn implies part of A. 3. 

Below, in condition A. 3, we shall introduce constants a,(3 > 0, for which 



Therefore, it is possible to choose an integer r > A\ and a constant 7 6 
[^2,^3]; such values will be used below. Let C > be an arbitrarily large 
but fixed constant, let a, ft > 0, and denote by Q = Q(C,a,(3) the class of 
distributions G of (X,W,Y) that satisfy A.1-A.3 below. 

Regarding the smoothness assumed of fxw m A. 2, we mention that our 
minimax rates do not alter if fxw is smoother than specified. The rates 
are optimized for smoothness of g, given enough smoothness of fxw- I n 
condition A. 3, the lower bound on a seems difficult to relax and, in fact, it 
has close analogues in related contexts, for example, in work on convergence 
rates in functional data problems. 

The upper bound on a, however, seems more likely to be tied to our 
method of proof. One approach to relaxing the bound might be to draw 
inspiration from a modified approach to Tikhonov regularization (see [10]) 
and use, as the ridged inverse, (T + dnD 213 - 1 )- 1 rather than (T + a n I)~ l . 
Here, if 2(3 — 1 were an integer, D 213 ' 1 would denote the ((3 - l)st power 
of the differential operator; if 2(3 — 1 were strictly greater than its integer 
part, I say, then D 2 @~ l would involve taking the convolution of g^\t) — 
g^(0) against the kernel |i|^ -2 ^. However, this approach requires a direct 
relationship between the smoothness of g, as expressed through the size of 
(3 in the formula \bj \ < Cj~@, and its smoothness in the more conventional 
sense of differentiation. We have avoided making assumptions about this 
relationship. In particular, as our results are presently formulated, g does 
not need to be continuous, let alone differentiable, no matter how large or 
small (3 might be. 

A.l. The data (Xj, Wi,Yi) are independent and identically distributed as 
(X,W,Y), where (X,W) is supported on [0,1] 2 and E{Y - g{X)\W = w} = 
0. 

A. 2. The distribution of (X, W) has a density, fxw, with r derivatives 
(when viewed as a function restricted to [0, l] 2 ) bounded uniformly in abso- 
lute value by C; and the functions E(Y 2 \W = w) and E(Y 2 \X = x, W = w) 
are bounded uniformly by C. 
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A. 3. The constants a and (3 satisfy a > 1, > \ and ft — ^ < a. < 2/3. 
Moreover, \bj\ < Cj~P, j~ a < CXj and Y,k>i \ d jk\ < Cj~ a / 2 for all j > 1. 

A. 4. The parameters a n and h satisfy a n x n - a /( 2 P+ a ) an d /j x 77,-7 as 
n — > co, where c n x (i n for positive constants c n and d n means that c n / d n is 
bounded away from zero and infinity. 

A. 5. The function K h (-,-) satisfies (2.4); for each t € [0,1], K h (h-,t) is 
supported on [(t — l)/h,t/h] fl/C, where K, is a compact interval not depend- 
ing on t; and 

sup \Kh(hu, t)\ < oo. 
/i>o,te[o,i],iie/c 

Theorem 4.1. is n^oo, 

sup f 1 E G {g{t) - g(t)} 2 dt = ( n -(2/3-i)/(2/3+a)^ 
G&Q J 

More generally, it may be proved that if a particular distribution of 
(X,W,Y) satisfies A.l, and if E(Y 2 ) < oo and the density fxw is con- 
tinuous on [0, 1], then a n and h can be chosen so that / Ec(g — g) 2 — ► as 
n — > oo. Similar results, guaranteeing consistent estimation but without a 
convergence rate, may be derived in the settings of Sections 4.2 and 4.3. 

4.2. Orthogonal series method for bivariate case. We shall simplify the- 
ory by assuming the Fourier coefficients satisfy a strong diagonality con- 
dition. Under this assumption it is sufficient to work with a strongly diagonal 
form of Q, where we redefine qjt = if \j — k\ > N (where iV is permitted 
to increase slowly with n), and leave qjk unchanged otherwise. With this 
alteration to cjjk, let Q = (<jjk) be the indicated m x m matrix. 

Recall from Section 2.3 that xi> X2, ■ ■ ■ is an orthonormal basis for L2[0, 1]. 
Let Fyy and Fx denote the marginal distribution functions of W and X , 
put W = F\y(W) and X = Fx(X), and let fy^x denote the joint density 

of (W,X). Write f^~(w,x) = T,jT,kQjkXj{x)Xk{w) and g(x) = Ej ljXj(%) 
for the generalized Fourier transforms of these functions. Recall that we 
require the transformation represented by QQ' to be invertible, so we may 

define Q _1 = (<zj fc ^) to be a generalized inverse of Q. 

Given constants a > 2, (3 > \ and C\,Ci > 0, let H = H(C±, C2, a, ff) 

denote the class of distributions G of (W,X,Y) for which 

E{Y - g(X)\W = w} = 0, \q jk \ < Ci{max(j, k)}~ a ' 2 exp(-C 2 |j - k\), 

l^^l ^ Ci{max(j, k)} a / 2 exp(-C 2 \j - k\), 

bil<Cir^ EiY^Kd, 
where the bounds are assumed to hold uniformly in 1 < j, k < 00. 
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Theorem 4.2. Let {xj} denote the orthonormalized version of the co- 
sine series on [0, 1]. Take a > 2 and (3 > \, and assume a n >c m~ a , m x 
v }/{W+ a ) ; _/y/ log n — > oo and N = 0(n £ ) for all e > 0. Then, asn— >oo, 

sup f 1 E G (g - gf = ( n -(2^-l)/(2/3+a) )> 

4.3. Kernel method for multivariate case. For each z G [O,!] 9 , let {</> 2 i, 
Z 2i---} denote the orthonormalized sequence of eigenvectors, and A 2 i > 
^z2 > • • • > the respective eigenvalues of the operator T z . Assume that 
{4> z j} forms an orthonormal basis of L2[0, l] p . Analogously to (4.1), 

oo 

3=1 

oo oo 

fxzw(x,z,w) = ^2^2d zjk (f) zj (x)(f) zk (z), 

j=lk=l 
oo 

g(x,z) = ^2b zj 4> zj (x), 

3=1 

where the d z j k s and 6 ZJ 's are generalized Fourier coefficients. 

Put r = 2r/(2r + q). If a, (3 > denote constants satisfying MV.3 below, 
then 

2a + 2(3 - 1 



B\ = max< p 



2(3 -a ' 

2 2a + 10/3 + l 6 \~ 1 (2a + 2(3- 1 3q 



5p 2(3 + a hp) \ 2(3 + a + 5p/' 2 j >0 ' 

n _ r 2a + 2(3 - 1 _ f r 2/3 - a 1 / 10/3 + 2a 

< i5o = ^ < Bi = mm< , — r — 

2r 2/3 + a ~ \2p2(3 + a hp\ 2(3 + a 

Choose r > B\ and 7 € [-62,-63]. We make the following assumptions, of 
which the first five are respectively analogous to A.1-A.5 in Section 4.1. Let 
C>0. 

MV.l. The data (Xi, Wi, Zi,Yi) are independent and identically distributed 
as (X, W, Z, Y), where X, W and Z are supported on [0, 1]* [0, l] p and [0, l] 9 , 
respectively, and E{Y — g(X, Z)\Z = z, W = w} = 0. 

MV.2. The distribution of (X,Z,W) has a density, fxzw, with r deriva- 
tives of all types (when viewed as a function restricted to [0, l] 2p+q ), each 
derivative bounded in absolute value by C; g(x,z) and b z j have r partial 
derivatives with respect to z, bounded in absolute value by C, uniformly 
in x and z; and the functions E(Y 2 \Z = z, W = w) and E(Y 2 \X = x, Z = 
z,W = w) are bounded uniformly by C. 
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MV.3. The constants a, (3 satisfy a > 1, f3 > \ and (3 — \ < a < 2(3. 
Moreover, \b z j\ < Cj~P, j~ a < CX z j and J2k>i \dzjk\ — Cj~ a / 2 , uniformly 
in z G [0,1]*, for all j > 1. 

MV.4. The parameters a n , /i^ and h z satisfy a n >c n _ar ^ 2 ^ +Q \ h x n -7 , 
as n — > oo. 

MV.5. The function K h (-,-) satisfies A. 5. 

MV.6. For each z € [0, l] g , the functions (j) z j form an orthonormal basis 
for L2IP, an d sup x sup 2 maxj \cj) z j(x) \ < 00. 

Let M. = M(C, a, f3) denote the class of distributions of (X, W, Z, Y) that 
satisfy MV.1-MV.3 and MV.6. 

Theorem 4.3. isn^oo, 

sup sup / E G {g{x,z)-g{x,z)} 2 dx = 0{n- T W-W li+ ^). 
GeM z e[o,i]iJlo,i]p 

4.4. Optimality. The convergence rates expressed by Theorems 4.1-4.3 
are optimal in those contexts, in a minimax sense. Indeed, let g denote any 
measurable functional of that data which is itself a measurable function on 
[0, 1] (in the cases of Theorems 4.1 and 4.2) or on [0, l] p (in the setting of 
Theorem 4.3); let C denote Q, H or Ai in the cases of Theorems 4.1-4.3, 
respectively; and put r = 1 in the contexts of Theorems 4.1 and 4.2, and 
r = 2r/(2r + q) for Theorem 4.3. 

Theorem 4.4. 

(4.2) hminfn^- 1 ^ 2 ^ inf sup / E G (g - gf > 0. 

n-*x> g GeC J 

In the multivariate setting of Section 4.3 we interpret the integral at (4.2) 

as 

/ EoW*.') -«*.•)?*. 

J[0,1]p 

and interpret Theorem 4.4 as stating that, for this representation, (4.2) holds 
for each z £ [0, l] 9 . 

5. Monte Carlo experiments. This section reports the results of a Monte 
Carlo investigation of the finite-sample performance of the kernel estimator 
for the bivariate model. The estimator is the one described in Section 2.3, 
although our method is not optimized for theoretical performance. In par- 
ticular, we took K to be a second-order kernel. 
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1.5 




0.5 








Fig. 1. Density of X and W used in Monte Carlo experiments. 



Samples of size n = 200 were generated from the model determined by 

oo 

fxw(x,w) = 2Cf^2(-l) j+1 j' 1 sm(j-Kx)sm(jirw), 0<x,w<l; 
i=i 

oo 

g{x) = 2 x /2 ^(_i)i+i r 2 sin^), y = E{g{X)\W = w} + V, 
i=i 

where Cf is a normalization constant and V is distributed as Normal N(0, 0.01) 
For computational purposes, the infinite series were truncated at j = 100. 
Figure 1 shows a graph of the marginal distributions of X and W, which 
are identical. The solid line in Figure 2 depicts g(x). The kernel function is 
the Epanechnikov kernel, K(x) = 0.75(1 — x 2 ) for \x\ < 1. 

Each experiment consisted of estimating g at the 19 points, x = 0.05,0.10, 
. . . , 0.95. The experiments were carried out in GAUSS using GAUSS pseudo- 
random number generators. There were 1000 Monte Carlo replications in 
each experiment. 

Table 1 shows the performance of the estimator, g, as & function of the 
bandwidth, h, and the ridge parameter, a n . The quantities Bias 2 , Var and 
MSE in the table were calculated as the averages, over the 19 values of 
x, of Monte Carlo approximations to pointwise squared bias, variance and 
mean squared error, respectively, at those points; the pointwise values were 
computed by averaging over the 1000 Monte Carlo simulations. 
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Table 1 

Results of Monte Carlo experiments 



a„ 


h 


Bias 2 


Var 


MSE 


0.05 


0.10 


0.0039 


0.0321 


0.0361 




0.20 


0.0065 


0.0162 


0.0227 




0.30 


0.0262 


0.0119 


0.0381 




0.40 


0.0525 


0.0087 


0.0612 


0.10 


0.10 


0.0118 


0.0221 


0.0339 




0.20 


0.0105 


0.0115 


0.0215 




0.30 


0.0141 


0.0078 


0.0219 




0.40 


0.0263 


0.0062 


0.0325 


0.15 


0.10 


0.0224 


0.0190 


0.0414 




0.20 


0.0165 


0.0098 


0.0263 




0.30 


0.0149 


0.0063 


0.0212 




0.40 


0.0220 


0.0049 


0.0269 


0.20 


0.10 


0.0335 


0.0174 


0.0508 




0.20 


0.0268 


0.0081 


0.0349 




0.30 


0.0214 


0.0058 


0.0272 




0.40 


0.0252 


0.0044 


0.0295 
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Results are illustrated graphically in Figure 2 for the case h = 0.2 and 
a n = 0.1. The figure shows g(x) (solid line), the Monte Carlo approximation 
to E{g(x)} (dashed line) and a 95% pointwise "estimation band." The band 
connects the points g(xj) ± 6j, for j = 1, . . . , 19, where each Sj is chosen so 
that the interval [g(xj) — 5j,g(xj) + 5j] contains 95% of the 1000 simulated 
values of g(xj). The figure shows, not surprisingly, that g is somewhat biased, 
but that the shape of Eg is similar to that of g. 

6. Technical arguments. 

6.1. Proof of Theorem 4.1. (The "big oh" bounds that we shall derive 
below apply uniformly in G 6 G, although for the sake of simplicity we shall 
not make this qualification.) Put T + = (T + a n /) _1 , let || • || denote the usual 
L2 norm for functions from the interval [0, 1] to the real line, and given a 
functional x from Z/2[0, 1] to itself, set 



IXl 



sup 

*/>6Z 2 [0,l]: II 



n {l/(2/ 3+a) }~l a -l + h 2r a -2 = 0(n _( 2/3 _l)/( 2/3+a ) : 



For future reference, we note that A. 3 and A. 4 imply that 
(6.1) 
Define 

g(x)fxw(x,w)T + (fxw ~ fxw)(z,w)dxdw, 



D n (z) 

A n i(z) 

A n2 {z) 
A n3 (z) 



1 



-J2( T+ fxw)(z,W l )Y l , 



11 



i=l 



J2i T+ (fxw - fxw)}(z, Wi)Yi - D n (z), 



i=i 



1 n 

~ E« T+ " T + )fxwHz, Wi)Yi + D n (z), 
n ~ 



i=i 
n 



An4(z) = ~J2{(T + -T+)(ft$-fxw)}(z,W i )Y i 



1=1 



Then g = A n \ + • • • + A n ^ and so the theorem will follow if we prove that 



(6.2) E\\A nl - gf = Oin-W-V/W^), 

(6.3) E\\A nj \\ 2 = ( ri -(2/3-l)/(2^) ) for j 

To derive (6.2), note that EA n \ — g = — a n J2j>i bj(Xj + a,,)" 1 ^. There- 
fore, 



2,3,4. 



\EAni 



^ (Xj + a r 



\2 ■ 
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Divide the last-written series up into the sum over j < J = a n l ^ a , and 
the complementary part, thereby bounding the right-hand side by 
anJ2j<j(bj/\j) 2 + J2j>jtij', and use A. 3 and A. 4 to bound each of these 
terms, hence, proving that 

(6.4) \\EA nl - g\\ 2 = O ( n -(20-i)/(2/3+«)). 

Using A. 2, we deduce that 

nvar{A nl (z)} < E[{(T+ f xw )(z, W)Y} 2 } 

= E[{(T + f xw )(z,W)} 2 E(Y 2 \W)} 
< const. B n , 

where B n = E[{(T + fxw)(z, W)} 2 } and, here and below, "const." will denote 
a positive constant, different at different appearances. It can be proved, from 
an expansion of T + fxw( z , w ) m its generalized Fourier series, that 

oo oo oo A A 



tk=xt=i ( A i + a «) 



oo oo oo 



\djkdje\ 



< const. £££ a s 2 

j= i k=ie=1 {^ + a n ) 

OO \ 

< const. ^ 



~{ ( A j + a n 



|2 ■ 



Therefore, 



f 1 00 A ■ 

/ varj^ifz)} dz < const. — 
J re ^ 



From this point, using the argument leading to (6.4), we may prove that 

E\\A nl -EA nl \\ 2 = J vai{A nl (z)} dz 

= 0(n' 1 a~ ( - a+1 ^ a ) 

= 0(?i- (2/3 " 1)/(2/3+a) ). 

Result (6.2) is implied by this bound and (6.4). 

Next we derive (6.3) in the case j = 2. Here and below, given a bivariate 
function (j)(z,w), put 4> w (z) = (p(z,w) and define T + cj)(z,w) = (T + cj) w )(z). 
Let 

D ni (z)= / g(x)fxw(x,w)T + (f^ - fxw)(z,w)dxdw, 
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1 n 

A n2 i(z) = -J2{T + (f { x -$ - fxw)(z,Wi)Yi - D ni (z)}, 
n i=i 

1 n 

A n2 2(z) = -J2{Dni{z) ~ D n (z)}, 



n . 



2 



in which notation A n 2 = A n 2\ + A n 22- Write / A n 2\(z) 2 dz as a double se- 
ries, and take the expected values of the terms one by one. It may be 
shown by tedious calculation that the total contribution of the terms equals 
0{h 2r {na 2 n )- 1 + (nha n )- 2 }. Therefore, 

(6.5) E\\A n21 \\ 2 = 0{h 2r {nal)- 1 + (nha n y 2 } = o^-^-i)/^)^ 
where we used (6.1) to obtain the second identity. Furthermore, 

A n 22(z) = -n' 1 J g(x)fxw(x,w)T + fxw(z,w)dxdw, 

from which, noting (6.1), it may be deduced that 

-E||A„22|| 2 < const. (na n )~ 2 E^J \gfxwf\ 

= 0{(na n )- 2 } = o^-^-iVP/S+a)). 

Property (6.3), in the case j = 2, follows from this result and (6.5). 

Next we derive (6.3) for j = 3. Define A = T — T, an operator, and put 

A n3 i = -(I + T + A)- 1 T+A< ? + D„, A n32 = -(/+r+A)- 1 T+A( J 4 nl -g). 

Noting that f+ - T+ = -(I + T + A)- 1 T + AT + , it can be seen that A n3 = 

A n3 l + A n3 2- 

Let 5 = h + (nh) . Using standard, but tedious, moment calculations, 
it may be proved that E(t — t) 2k = 0(5 k ) for each integer k > 1, uniformly in 
the argument of t — t. [The quantity S involves (nh)" 1 , rather than (ro/i 2 ) -1 , 
since the integral in the definition of t effectively removes one of the fac- 
tors h .] Therefore, since ||A|| 2 = J(t — t) 2 , then for each integer k > 1, 

(6.6) E\\Af k = 0(8 k ). 

At the end of this proof we shall show that, for each k > 1, 

(6.7) J B{||(/ + T + A)- 1 || fc } = 0(l) 

as n — > oo. Hence, using the Cauchy-Schwarz inequality, 

{E\\(I + T+Ay 1 T+A\\ i } 2 < E\\(I + T + A)- 1 \\ 8 \\T + \\ 8 E\\A\\ S 

(6.8) 

= 0(8 4 /a s n ). 
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From this result, and the Cauchy-Schwarz inequality again, we obtain 

E\\A n32 f < {^||(/ + r+A)- 1 T+A|| 4 J Bp nl - 9 || 4 } 1/2 
(6.9) =0{(5/a 2 n ) 2 (E\\A nl -g\\^} 

= 0(n-^- 1 )/^ +a )), 

the final identity following using an argument similar to that leading to (6.2). 
Put 



Bni(z) = J {fxw(x,w) - fxw(x,w)}fxw(z,w)g(x)dxdw, 
B n2 {z) = J {fxw(z,w) - fxw(z,w)}fxw(x,w)g(x)dxdw, 
Bni{z) = J {fxw(x,w) - fxw(x,w)}{fxw(z,w) - f X w(x,w)}g(x)dxdw, 
B n ii (z) = J{Efxw (x, w) - fxw(x, w)}fxw (z, w)g(x) dx dw, 
B n i 2 (z) = J{fxw(x,w) - Efxw(x,w)}fxw(z,w)g(x)dxdw, 
B n 2i(z)= / {Efxw(z,w) - fxw(z,w)}fxw(x,w)g(x)dxdw, 



B n22 (z) = J {fxw(z,w) - Efxw(z,w)}fxw(x,w)g(x)dxdw. 

In this notation, Ag = B nl + B n2 + B n3 , B nl = B nl i + B nV2 , B n2 = B n21 + 
B n22 and T + B n2 = D n , whence 

A n31 = -(1 + T + A)- 1 T+(B„ 1 i + B nl2 + B n3 ) 

+ (/ + r+A)- 1 r+AT+( J B n21 + B n22 ). 

Define 

A n31 = - (/ + T + A)- 1 T+(B nll + B nl2 + B n3 ) + (/ + T + A)- 1 r+AT+S n21 . 
Then 

^Pn3i|| 2 <const.{ J B||i n3 i|| 2 + ^||(/ + r + A)- 1 r + Ar + J B n22 || 2 }. 
By (6.7) and the Cauchy-Schwarz inequality, 

£||in3i|| 2 < const. (\\T + B nU || 4 + E\\T + B nl2 \\ 4 

(6.10) 

+ £||r + AT+i? n21 || 4 + £||r+i? n 3|| 4 ) 1/2 . 

Since ||5 nl i|| + \\B n21 \\ = 0(h r ) and ||T+|| = 0(0, then, by (6.1), 
||T+B n ii|| + ||T + S n21 || < ||T + ||(||£? nll || + ||£ re2 i||) 



(6.11) 



0{h r a~ 1 ) = 0(n-W- l WW +a »). 
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Furthermore, with 

A jk = / {fxw(x,w) - Efxw(x,w)}4>j(x)(j)k(x)dxdw, 



we have 



OO OO OO 7 7 A 



i=ik=u=i A i + a « 

Now E(Aj 1 f Cl Ag imi Aj 2 /« 2 A,? 2m2 ) = 0(?i -2 ), uniformly in the indicated in- 
dices; J2e \bt\ < oo, since A. 3 implies that f3 > 1; and X)fc>i l^jifcl = 0(j~ a ^ 2 ), 
again by A. 3. Therefore, 

2^ 2-| 1/2 

(SIIT+Bniall 4 ) 1 /* 



{oo -I / oo oo \ 2\ S 

g wfepH ) 



i ^ i 



-i oo ■—a 

1 J 



. k=ie=i 



= o - V — ^ ^ = Ofn-^" 1 )/^)). 

l^(Aj+a») 2 J 

In view of (6.1) and (6.6), 

(6.13) £||T+A|| 8 < ||T+|| 8 £|| A|| 8 = 0(a~ s E\\A\\ 8 ) = 0(5* /a 8 ) = O(l). 

By (6.11), (6.13) and the Cauchy-Schwarz inequality, 

(E\\T + AT + B n21 \\ 4 ) 1/2 < (£||T+A|| 8 £||T+5 n21 || 8 ) 1/4 

(6.14) 

= 0(rr (2/3_1)/(2/3+a) ). 

Define 

!n{w) = {fxw(x,w) - fxw{x,w)}g(x)dx, 



J n = \ \ {T + (fxw - fxw)(z,w)} 2 dwdz. 



Moment calculations show that £||/„,|| 8 = 0(6 4 ) and E(J%) = <3(<5 4 /a 8 ), and 
so by the Cauchy-Schwarz inequality, 

(£||r + B n3 || 4 ) 1/2 < {£(||/„|| 4 J„ 2 )} 1/2 < (E\\I n fE4)^ 

(6.15) 

= 0(5 2 /a 2 n ) = 0(n~W~ i yW^). 
It follows from (6.10)-(6.12), (6.14) and (6.15) that 
(6.16) £J||i„ 31 || 2 = 0(n-^-W +a )). 
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Now consider 

(/ + T + A)- 1 r + AT + J B n22 

= /(||T+A|| < i)(/ + T + A)" 1 T + AT + S n22 

(6.17) 

+ /(||T+A|| >|)(/ + r + A)- 1 T + AT + J B n22 
= H n \ + H n2 , 

say. We first investigate H n \. 

If ||T + A|| < i, then for some constant D not depending on ■0, ||(/ + 
T+A)- 1 ^!! 2 < £>IH| 2 - Therefore, ||#„i|| 2 < D\\T+ AT + B n22 \\ 2 ■ Some alge- 
bra shows that 

T + AT + B n22 (z) = R nl (z) + R n2 (z) + R n3 (z), 

where 

Rnl(z) = J t + (z,u){fxw(x,W 1 ) - fxw(x,Wl)}fxw(u,Wl) 

x t + (x,v){fxw{v,w 2 ) — Efxw(v,w 2 )}H(w 2 )dudvdxdwidw 2 , 

Rn2{z) = J t + (z,u){fxw{u,W 1 ) - fxw{u,Wl)}fxw(x,Wl) 

x t + (x,v){fxw{v,w 2 ) — Efxw(v,w 2 )}H(w 2 )dudvdxdwi dw 2 , 

Rnz{z) = J t + (z,u){fxw(u,W 1 ) - fxw{u,Wl)}{fxw{x,Wi) ~ fxw{x,Wl)} 

x t + (x, v){fxw(v, w 2 ) — Efxw(v,w 2 )}H(w 2 )dudvdxdwidw2- 
First we treat R n \. Write R n i = R n n + R n i2, where 

Rnll(z) = J t + (z,u){fxw{x-,Wi) - Efxw(x,Wl)}fxw{u,W\) 

x i + (x, v){fxw{ v i w 2) — E f xw ( v [ ,w 2 )}H (w 2 ) du dv dx dw\dw 2 , 
Rnu(z) = J t + (z,u){Efxw(x,w 1 ) - fxw(x,wi)}fxw(u,wi) 

x t + (x,v){fxw(v,w 2 ) - Efxw(v,w 2 )}H(w 2 ) dudvdxdw 1 dw 2 . 
By the Cauchy-Schwarz inequality, 

t + (z,u){fxw(x,wi) - Efxw(x,wi)} 



-Rnllll 2 < 



x fxw(u,wi) dudwi 



2 

dx dz 
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x J [t + (x,v){fxw(v,w 2 ) - Efxw(v,w 2 )}H(w 2 ) dv dw 2 ] 2 dx 
= A n iA n2 , 

say. Further application of the Cauchy-Schwarz inequality gives 

(6.18) E\\R nll f ^{(EAl^EAi,)} 1 / 2 . 
Also, 

(6.19) (EAl^/^OKnhalr 1 }. 

Now define 5k{x) = f{fxw(x,w) — E fxw{x,w)}(j)k(w) dw. Then 



OO OO OO J 1 

--lk=li=l ^ +a »^ 



J 

from which it follows that 

^ (A " i)=o {^i£§ 

= 0(hr 1 n-W- 1 VW +a ')). 
Combining this result with (6.18) and (6.19), we obtain 

(6.20) E\\R nXl f = o( -^n-W- i yW + A=0{n-W~ 1 VW + ^). 
\nh z af l J 

Calculations in the case of R n \ 2 are similar, as follows. We re-define 
h{x) = / {Efxw(x,w) - fxw(x,w)}(j)k(w)dw = 0(h r ). 



Therefore, 

,2r- 1 



V a£ J 



Combining this result with (6.20), we deduce that 
(6.21) E\\R nl f = 0{n-^-^l^ +a \ 

Next we treat R n2 . Re-define A n \ and A n2 by 



Rn2{zf < 



t + (z,u){fxw{u,wi) - fxw{u,wi)}du 
fxw{x,wi)t + (x,v) 



dwi 



x {fxw{v,w 2 ) - Efxw(v,w 2 )}H(w 2 )dvdxdw 2 
A nl (z)A n2 (z). 



1 2 

dw\ 
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Furthermore, 

(6.22) mAnlf y» = o(^ + ^ 

Defining Sjk = J{fxw(x 1 w)—Ef(xw(x,w)}4>j(x)4>k(w)dxd'waiidhj = j H(f>j, 
we have 

O0/O000rrj\2 OOOOOOxrrj; 

djkQjeiie \ _ \ ^ \ ^ \ ^ AjOjiOj S n£n s 

\2 ■ 



^ 2 = j djkOjeM \ _ y, y, y, Ajd j£ d js ti e 



Therefore, 

(£||,4n2|| 2 ) 1/2 = O jn- 1 g - 1 = 0(„-(V-D/WHa)). 

i ^ (Aj + a n ) 2 J 

This result and (6.22) give 

(6.23) E\\R n2 \\ 2 = 0(n -(2/3-i)/(2/3+a) ) _ 

Next we treat R n 3- Note that i? n 3(z) 2 < A n i(z)A n2 (z), where we re-define 



A nl (z) 



t + (z,u){fxw{u,wi) - fxw{u,wi)} 



x {fxw(x,wi) - fxw{x,wi)}dudwi 



dx, 



A n2 (z) = 
Therefore, 



1 2 



t + (x,v){fxw(v,w 2 ) - Efxw{v,w 2 )}H(w 2 )dvdw 2 



dx. 



E WRm\\ 2 = Of^ir + -fa) = 0(^ (2 ^ 1)/(2/3+Q) ). 

Combining this result with (6.21) and (6.23), and recalling the definition of 
H n \ at (6.17), we deduce that 

(6.24) E|| J H" nl || 2 = 0(n-^- 1 )/^+ Q )). 

Now we consider H n2 . We have 

||(/ + r + A)- 1 V,|| = ||f + (T + a„/)V|| 

<||T+||||T + a n /|||HI 
< const, ara/ -1 !!^!!- 

Therefore, 

||^n 2 || 2 < const. a r ; 2 /(||T+A||>i)||r + AT+S n22 || 2 , 
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and so by the Cauchy-Schwarz inequality, 

E\\H n2 \\ 2 < const. a- 2 P(||T+A|| > \) 1/2 {E\\T + AT+ B n22 \\ A ) 1/2 . 
We shall prove shortly that, for all £ > 0, 
(6.25) P(\\T + A\\>\) = 0{(5/a 2 J}. 

Moreover, 

E\\T + B n22 f < \\T + \\ 8 E\\B n22 \\ 8 = 0(a- 8 E\\B n2 2\\ 8 ) 

(6.26) 

= o[a- s ^f Bl 22 y^ = 0{(nha 2 n )-% 

the last identity following by moment calculations similar to those leading 
to (6.6). Combining (6.13) and (6.26), and applying the Cauchy-Schwarz 
inequality, we deduce that 

(siir+Ar+B^H 4 ) 1 / 2 <(s||r + A|| 8 s||r + B n22 || 8 ) 1 /4 = {( ( y/ a 2)( n ^ 2)-i}. 

Using this result together with (6.25), and choosing t sufficiently large, we 
obtain 

£||F n2 || 2 = 0{(5/a 2 n ) l+ ^ 2 \nha 2 n r 1 } = O ( n -(20-D/(2/»+«)). 
Combining this result with (6.17) and (6.24), we obtain 

E\\{I + T + A)- l T+AT+B n22 \\ 2 = 0{E\\H nl \\ 2 + E\\H n2 \\ 2 ) 

= O( n -(2/9-l)/(20+a))_ 

Result (6.3) for j = 3 follows from this formula and (6.16). 

Next we derive (6.3) for j = 4. Since f+ -T+ = -(I + T + A)- 1 T + AT + 
and I - f+T = -(I + T+A)~ 1 T + A, then 

A nA = -(I + T + A)- 1 T + A(A n2 -T + B n2 ). 

The arguments leading to (6.3) with j = 2, and (6.15), may be used to prove 
that 

r, = {(5 2 /a n ) 4 E\\A n2 \\ 4 + EWT+AT+B^f} 1 / 2 = 0(rj -(2/3-i)/(2/3+a) ) _ 
Therefore, by (6.7), (6.8) and the Cauchy-Schwarz inequality, 
E\\ A nA \\ 2 < 2 {E\\(I + T + A)- 1 r + A|| 4 ^P„ 2 || 4 } 1 / 2 

+ 2{E\\(I + T+A)" 1 f^llr+AT+^H 4 } 1 / 2 
= 0{r 1 ) = 0{n~W~ l VW + ^). 
This proves (6.3) for j = 4. 
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It remains to derive (6.7). Let tp £ L2[0, 1]. Then, for constants not de- 
pending on tp, if ||T + A|| < 2, 

||(/ + T + A)~V|| < const. H-0II 
and, without any constraint on ||T + A||, 

||(/ + T+ArV|| = ||T + (T + a„/)V>|| 

<||f+||||T + a„/|||^|| 
< const, a" 1 !!?/^. 

Therefore, 

\\{I + T + A)~ 1 \\ ^const.jl + a^/dlr+AH >|)}. 

Hence, noting (6.6), and employing Markov's inequality to bound P(||T + A|| > i), 
we deduce that, for each fixed k,£ > 0, 

J B{||(/ + r + A)- 1 || fc }<const.{l + a- fc P(||T + A|| >i)} 

< const. {l + a- fc £(||T+A||^)} 
(6.27) < const. {1 + a~ k - 2i E{\\ A|| 2£ )} 

< const. (1 + a~ fc ~ 2 V) 

= const.{l + a; i k (S/a 2 n Y}, 

where the constants depend on k and I but not on n. If k is given, then we 
may choose I = £(k) so large that a~ k {5/a 2 l Y — > as n — > oo, and so (6.7) 
follows from (6.27). This argument also gives (6.25). 

6.2. Proof of Theorem 4.2. Put p = (pi, ... ,Pm)', where pj = Ea{g(X) x 
Xj(W)} = Ec{Yxj(W)}. Let 7 = (7^) and p = (pj) denote infinite column 
vectors, and let Q be the m x m upper left-hand sub-matrix of Q. Since 
P = Ql, then pj =pj(G) = 0(j~( 2l3+a ^ 2 ), uniformly in G G H, as j — > 00. 
Therefore, (Q'p)i = 0(i"^ a+ ^), uniformly in 1 < i < m, n > 1 and G G Ti. 
This result will be used below without further reference. 

Put M = QQ' + a n I m and M = QQ' + a n I m . It may be deduced from 
the definition of Ti that the bounds on \qjk\ and \qj k \ in that definition 
apply too to the (j, k)th elements of M and M , respectively, provided we 
replace a by 2a and alter the constants C\ and C2 (retaining their positivity, 
of course). The bounds are valid uniformly in 1 < j,k <m and n > 1, and 
permit it to be proved that 

(M- 1 Q'p) J = {{Q'Q)~ 1 Q'p} J + 0{rrT p ) = 7j + 0(m-P), 
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uniformly in 1 < i < m, n > 1 and distributions of G € TL. Note too that 

M _1 = {/ + hr l (M - M)} _1 M _1 , 
M^Q'p - M^Q'p = {KT 1 + (M" 1 - M- 1 )}^ - Q'p) 

+ {M~ 1 -M- l )Q'p. 
Prom these properties it may be shown that 

^{E(7,-7,) 2 } 



(6.28) 



+ E G (jT[{M-\M - M)M-\Q'p - Q'p)}^ 
+ E G [ Y J [{^r\M-M)M- l Q'p} j } 2 )+m l - 2 P 



\i=l / ) 

uniformly in G G Ti- 
lt may be proved byJTaylor expansion arguments, involving approximat- 
ing Wi = Fw(Wi) by Wi = F\y(Wi), and analogously for Xi and Xi, that, 
for each r, e > 0, 

(6.29) max sup ^cl^fc - ^fcT' = 0(n~ r/2 ), 

l<J><n (1/2)_E GeH 

(6.30) max sup E G {pj — Pj) 2 = (^(n" 1 ). 
l<J<n <1/2) " E GeW 

Rather standard, but tedious, moment calculations, using (6.29) and (6.30), 
may be employed to show that each of the expected values on the right-hand 
side of (6.28) equals 0(n~ 1 m a+1 ), uniformly in G E TL. Therefore, 



sup £ £ G {(7i - 7i) 2 } = Otn" 1 ™^ 1 + 



m 



1-2/3} 



(6.31) 

= 0(n" {2/3 ~ 1)/(2/3+Q)) ). 

It follows from the definition of TL that J2j> m lj = 0(7n 1 ~ 2 ^), uniformly 
in G £TL. This result and (6.31) imply that 



r m oo 

j E G {g-gf = Y J E G ^ 3 - l3 ) 2 + £ ^ = Q( 



l (2/8-l)/(2/3+a)> 



j'=l j=m+l 

uniformly in G £ completing the proof of the theorem. 
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6.3. Proof of Theorem 4.4. For simplicity, we deal only with the orthog- 
onal series setting, discussed in Section 4.2. We may assume the following: 
(f)j =Xji 01 = 1 an d (f)j+i(x) = 2 -1 / 2 cos(jirx), for j > 1; the marginal distri- 
butions of X and W are uniform on the unit interval; and 



i=i 

(6.32) 

2m 
j=m+l 

where m equals the integer part of n l ^ 2 ^ +a \ the 8j's are all either or 1, 
and V is Normal N(0, 1), independent of (X,W). 

The function g implied by (6.32) is g(x) =J2m+i<j<2m@jj~^ ( fij( x )- Note 
too that if g is an estimator of g, then 

(6.33) Oj=f 

may be viewed as an estimator of 8j. 

A standard argument based on the Neyman-Pearson lemma shows that 

(6.34) liminf inf inf sup*E(0j - 6> ) 2 > 0, 

n-*oo m+l<j<2m $. J J 

where sup* denotes the supremum over all 2' m different distributions of 
(X,W,Y) obtained by taking different choices of m +i, . . . ,#2m m (6.32), 
and infg. represents the infimum over all measurable functions 8j of the 

data. To derive (6.34), it suffices to take 8j to be the likelihood-ratio rule for 
distinguishing between 0j = and 8j = 1, and work through a little asymp- 
totic theory to obtain the version of (6.34) when "inf^" is omitted from the 
left-hand side. 

Therefore, if g is given, and 8 m +i, • • • , #2m are the estimators of m +i, ■ ■ ■ , 82m, 
respectively, derived from g as suggested at (6.33), then 

2m 

2/3 



sup* JCg-gf = sup* J2 E (h-Qj?3 



j=m+l 
2m 

> const. 3~ W 

j=m+l 

> const. J -(2/S-l)/(2^+a) j 

where the constants do not depend on choice of g. This proves the theorem. 
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